A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu

نویسندگان

  • Riyaz Ahmad Bhat
  • Irshad Ahmad Bhat
  • Naman Jain
  • Dipti Misra Sharma
چکیده

In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this paper, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between their texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentencelevel decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Development of a Complete Urdu-Hindi Transliteration System

Hindi and Urdu are variants of the same language, but while Hindi is written in the Devnagri script from left to right, Urdu is written in a script derived from a Persian modification of Arabic script written from right to left. The difference in the two scripts has created a script wedge as majority of Urdu speaking people in Pakistan cannot read Devnagri, and similarly the majority of Hindi s...

متن کامل

Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar

In this paper, we present a system for transliterating the Arabic-based script of Urdu to a Roman transliteration scheme. The system is integrated into a larger system consisting of a morphology module, implemented via finite state technologies, and a computational LFG grammar of Urdu that was developed with the grammar development platform XLE (Crouch et al. 2008). Our long-term goal is to han...

متن کامل

Developing English-Urdu Machine Translation Via Hindi

The paper presents a strategy for deriving English to Urdu translation using English to Hindi MT system. The English-Hindi lexical database is used to collect all possible Hindi words and phrases. These are further augmented by including their morphological variations and attaching all possible postpositions. This list is used to provide mapping from Hindi to Urdu. There may be change in gender...

متن کامل

A First Approach Towards an Urdu WordNet

This paper reports on a first experiment with developing a lexical knowledge resource for Urdu on the basis of Hindi WordNet. Due to the structural similarity of Urdu and Hindi, we can focus on overcoming the differences in the scriptual systems of the two languages by using transliterators. Various natural language processing tools, among them a computational semantics based on the Urdu ParGra...

متن کامل

Computational evidence that Hindi and Urdu share a grammar but not the lexicon

Hindi and Urdu share a grammar and a basic vocabulary, but are often mutually unintelligible because they use different words in higher registers and sometimes even in quite ordinary situations. We report computational translation evidence of this unusual relationship (it differs from the usual pattern, that related languages share the advanced vocabulary and differ in the basics). We took a GF...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016